AITopics | image and sentence

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Neural Information Processing SystemsSep-30-2025, 10:57:54 GMT

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

bidirectional image sentence mapping, deep fragment embedding, name change, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.42)

Add feedback

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Andrej Karpathy, Armand Joulin, Li F. Fei-Fei

Neural Information Processing SystemsFeb-9-2025, 10:57:33 GMT

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Neural Information Processing SystemsJan-18-2025, 01:35:35 GMT

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

bidirectional image sentence mapping, deep fragment embedding, image-sentence retrieval task, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.47)

Add feedback

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Neural Information Processing SystemsMar-13-2024, 10:13:04 GMT

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

fragment, image and sentence, representation, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge

Wang, Wenbin, Ding, Liang, Shen, Li, Luo, Yong, Hu, Han, Tao, Dacheng

arXiv.org Artificial IntelligenceJan-12-2024

Sentiment analysis is rapidly advancing by utilizing various data modalities (e.g., text, image). However, most previous works relied on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs) and thereby restricting their ability to achieve better multimodal sentiment analysis. In this paper, we proposed a plug-in framework named WisdoM, designed to leverage contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis. WisdoM utilizes a LVLM to comprehensively analyze both images and corresponding sentences, simultaneously generating pertinent context. To reduce the noise in the context, we also introduce a training-free Contextual Fusion mechanism. Experimental results across diverse granularities of multimodal sentiment analysis tasks consistently demonstrate that our approach has substantial improvements (brings an average +1.89 F1 score among five advanced methods) over several state-of-the-art methods. Code will be released.

knowledge, sentiment analysis, world knowledge, (13 more...)

arXiv.org Artificial Intelligence

2401.06659

Country:

Asia > Middle East > Syria > Aleppo Governorate > Aleppo (0.06)
North America > United States > Wisconsin (0.05)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (0.93)
Health & Medicine (0.69)
Media (0.68)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-Level Structural Information

Li, Zejun, Wei, Zhongyu, Fan, Zhihao, Shan, Haijun, Huang, Xuanjing

arXiv.org Artificial IntelligenceMar-21-2021

In this paper, we focus on the problem of unsupervised image-sentence matching. Existing research explores to utilize document-level structural information to sample positive and negative instances for model training. Although the approach achieves positive results, it introduces a sampling bias and fails to distinguish instances with high semantic similarity. To alleviate the bias, we propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples. Furthermore, to recognize the complex pattern in intra-document samples, we propose a Transformer based model to capture fine-grained features and implicitly construct a graph for each document, where concepts in a document are introduced to bridge the representation learning of images and sentences in the context of a document. Experimental results show the effectiveness of our approach to alleviate the bias and learn well-aligned multimodal representations.

image-sentence pair, representation, similarity, (16 more...)

arXiv.org Artificial Intelligence

2104.02605

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Karpathy, Andrej, Joulin, Armand, Fei-Fei, Li F.

Neural Information Processing SystemsFeb-14-2020, 08:56:53 GMT

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

artificial intelligence, bidirectional image sentence mapping, deep fragment embedding, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.53)

Add feedback

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Karpathy, Andrej, Joulin, Armand, Fei-Fei, Li F.

Neural Information Processing SystemsDec-31-2014

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: